Skip to content

feat: add local llama-cpp embedding support#1388

Merged
MaojiaSheng merged 2 commits intovolcengine:mainfrom
Mijamind719:embedding_local
Apr 15, 2026
Merged

feat: add local llama-cpp embedding support#1388
MaojiaSheng merged 2 commits intovolcengine:mainfrom
Mijamind719:embedding_local

Conversation

@Mijamind719
Copy link
Copy Markdown
Collaborator

@Mijamind719 Mijamind719 commented Apr 12, 2026

Description

This PR extends the openviking-server init interactive setup wizard to support a second local embedding path based on llama-cpp-python, while preserving the existing Ollama-based local setup flow.

In other words, this PR follows the "route 2" direction we discussed:

  • keep the Ollama setup path
  • add a parallel llama-cpp-python setup path
  • do not replace or remove Ollama
  • guide users to install local embedding dependencies through the optional extra dependency path

This PR builds on top of the existing local embedding runtime support already present on main; it mainly focuses on setup UX, config generation, and a config initialization deadlock fix.

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Extend openviking-server init with a new llama.cpp / GGUF local embedding path.
  • Keep the existing Ollama setup path as a separate option instead of replacing it.
  • Add a local GGUF preset for bge-small-zh-v1.5-f16.
  • Add interactive checks for llama-cpp-python, with guided installation via:
    • pip install "openviking[local-embed]"
  • Add local config generation for:
    • builtin GGUF preset
    • optional cache dir
    • optional VLM config
  • In the llama.cpp setup path, allow three VLM choices:
    • Ollama VLM
    • Cloud API VLM
    • Skip VLM
  • Add lazy imports in openviking/storage/__init__.py to avoid import-time deadlock when the native engine extension is loaded.
  • Fix re-entrant config singleton initialization in OpenVikingConfigSingleton.
  • Add and update setup wizard tests for the new local config path.

Why This Matches Route 2

This PR matches the route-2 direction because it does not collapse everything into a single Ollama-only or llama-cpp-only flow.

Instead, it keeps both local paths available:

  • Ollama remains the service-style local model option
  • llama-cpp-python becomes the direct GGUF local embedding option

That is the core product shape we agreed on for route 2.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Validation performed:

  • python3 -m py_compile openviking/storage/__init__.py openviking_cli/setup_wizard.py openviking_cli/utils/config/open_viking_config.py tests/cli/test_setup_wizard.py
  • PYTHONPATH=. ./.venv/bin/python -m pytest -q tests/cli/test_setup_wizard.py --maxfail=1
  • PYTHONPATH=. ./.venv/bin/python -m pytest -q tests/cli/test_doctor.py --maxfail=1

Known Limitations

  • This PR focuses on setup/config UX, not on reworking the embedding runtime itself.
  • The wizard can generate config even if the local dependency install step is skipped or fails; actual runtime validation still happens later during startup/doctor checks.
  • The builtin preset list is intentionally small for now and currently includes bge-small-zh-v1.5-f16 only.

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

N/A

Additional Notes

The old PR description referred to the earlier local embedding runtime implementation. This has been updated to reflect the current scope of the branch, which is now centered on the setup wizard, route-2 local setup shape, and the config/import fixes.

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 70
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add embedding metadata validation to collection initialization

Relevant files:

  • openviking/storage/collection_schemas.py
  • openviking/storage/errors.py
  • openviking/storage/viking_vector_index_backend.py
  • openviking/storage/vikingdb_manager.py
  • tests/storage/test_collection_schemas.py

Sub-PR theme: Add local llama-cpp embedding support

Relevant files:

  • openviking/models/embedder/init.py
  • openviking/models/embedder/base.py
  • openviking/models/embedder/local_embedders.py
  • openviking_cli/doctor.py
  • openviking_cli/utils/config/embedding_config.py
  • tests/cli/test_doctor.py
  • tests/misc/test_config_validation.py
  • tests/unit/test_local_embedder.py
  • pyproject.toml

⚡ Recommended focus areas for review

Backward Compatibility Break

The init_context_collection function now raises EmbeddingConfigurationError when the storage backend does not implement get_collection_meta, breaking existing deployments that use backends without this method. Previously, the function simply returned False when the collection already existed.

existing_meta = None
if hasattr(storage, "get_collection_meta"):
    existing_meta = await storage.get_collection_meta()

if not existing_meta:
    raise EmbeddingConfigurationError(
        "Existing collection metadata is unavailable; cannot validate embedding compatibility"
    )
Blocking Async Operations

The LocalDenseEmbedder does not override the base class async methods (embed_async, embed_batch_async). The base class default implementation may not properly offload the blocking llama-cpp operations to a thread pool, potentially starving the async event loop.

class LocalDenseEmbedder(DenseEmbedderBase):
    """Dense embedder backed by a local GGUF model via llama-cpp-python."""

    def __init__(
        self,
        model_name: str = DEFAULT_LOCAL_DENSE_MODEL,
        model_path: Optional[str] = None,
        cache_dir: Optional[str] = None,
        dimension: Optional[int] = None,
        query_instruction: Optional[str] = None,
        config: Optional[Dict[str, Any]] = None,
    ):
        runtime_config = dict(config or {})
        runtime_config.setdefault("provider", "local")
        super().__init__(model_name, runtime_config)

        self.model_spec = get_local_model_spec(model_name)
        self.model_path = model_path
        self.cache_dir = cache_dir or DEFAULT_LOCAL_MODEL_CACHE_DIR
        self.query_instruction = (
            query_instruction
            if query_instruction is not None
            else self.model_spec.query_instruction
        )
        self._dimension = dimension or self.model_spec.dimension
        if self._dimension != self.model_spec.dimension:
            raise ValueError(
                f"Local model '{model_name}' has fixed dimension {self.model_spec.dimension}, "
                f"but got dimension={self._dimension}"
            )

        self._resolved_model_path = self._resolve_model_path()
        self._llama = self._load_model()

    def _import_llama(self):
        try:
            module = importlib.import_module("llama_cpp")
        except ImportError as exc:
            raise EmbeddingConfigurationError(
                "Local embedding is enabled but 'llama-cpp-python' is not installed. "
                'Install it with: pip install "openviking[local-embed]". '
                "If you prefer a remote provider, set embedding.dense.provider explicitly in ov.conf."
            ) from exc

        llama_cls = getattr(module, "Llama", None)
        if llama_cls is None:
            raise EmbeddingConfigurationError(
                "llama_cpp.Llama is unavailable in the installed llama-cpp-python package."
            )
        return llama_cls

    def _resolve_model_path(self) -> Path:
        if self.model_path:
            resolved = Path(self.model_path).expanduser().resolve()
            if not resolved.exists():
                raise EmbeddingConfigurationError(
                    f"Local embedding model file not found: {resolved}"
                )
            return resolved

        cache_root = Path(self.cache_dir).expanduser().resolve()
        cache_root.mkdir(parents=True, exist_ok=True)
        target = get_local_model_cache_path(self.model_name, self.cache_dir)
        if target.exists():
            return target

        self._download_model(self.model_spec.download_url, target)
        return target

    def _download_model(self, url: str, target: Path) -> None:
        logger.info("Downloading local embedding model %s to %s", self.model_name, target)
        tmp_target = target.with_suffix(target.suffix + ".part")
        try:
            with requests.get(url, stream=True, timeout=(10, 300)) as response:
                response.raise_for_status()
                with tmp_target.open("wb") as fh:
                    for chunk in response.iter_content(chunk_size=1024 * 1024):
                        if chunk:
                            fh.write(chunk)
            os.replace(tmp_target, target)
        except Exception as exc:
            tmp_target.unlink(missing_ok=True)
            raise EmbeddingConfigurationError(
                f"Failed to download local embedding model '{self.model_name}' from {url} "
                f"to {target}: {exc}"
            ) from exc

    def _load_model(self):
        llama_cls = self._import_llama()
        try:
            return llama_cls(
                model_path=str(self._resolved_model_path),
                embedding=True,
                verbose=False,
            )
        except Exception as exc:
            raise EmbeddingConfigurationError(
                f"Failed to load GGUF embedding model from {self._resolved_model_path}: {exc}"
            ) from exc

    def _format_text(self, text: str, *, is_query: bool) -> str:
        if is_query and self.query_instruction:
            return f"{self.query_instruction}{text}"
        return text

    @staticmethod
    def _extract_embedding(payload: Any) -> List[float]:
        if isinstance(payload, dict):
            data = payload.get("data")
            if isinstance(data, list) and data:
                item = data[0]
                if isinstance(item, dict) and "embedding" in item:
                    return list(item["embedding"])
            if "embedding" in payload:
                return list(payload["embedding"])
        raise RuntimeError("Unexpected llama-cpp-python embedding response format")

    @staticmethod
    def _extract_embeddings(payload: Any) -> List[List[float]]:
        if isinstance(payload, dict):
            data = payload.get("data")
            if isinstance(data, list):
                vectors: List[List[float]] = []
                for item in data:
                    if not isinstance(item, dict) or "embedding" not in item:
                        raise RuntimeError(
                            "Unexpected llama-cpp-python batch embedding response format"
                        )
                    vectors.append(list(item["embedding"]))
                return vectors
        raise RuntimeError("Unexpected llama-cpp-python batch embedding response format")

    def embed(self, text: str, is_query: bool = False) -> EmbedResult:
        formatted = self._format_text(text, is_query=is_query)

        def _call() -> EmbedResult:
            payload = self._llama.create_embedding(formatted)
            return EmbedResult(dense_vector=self._extract_embedding(payload))

        try:
            result = self._run_with_retry(
                _call,
                logger=logger,
                operation_name="local embedding",
            )
        except Exception as exc:
            raise RuntimeError(f"Local embedding failed: {exc}") from exc

        estimated_tokens = self._estimate_tokens(formatted)
        self.update_token_usage(
            model_name=self.model_name,
            provider="local",
            prompt_tokens=estimated_tokens,
            completion_tokens=0,
        )
        return result

    def embed_batch(self, texts: List[str], is_query: bool = False) -> List[EmbedResult]:
        if not texts:
            return []

        formatted = [self._format_text(text, is_query=is_query) for text in texts]

        def _call() -> List[EmbedResult]:
            payload = self._llama.create_embedding(formatted)
            return [
                EmbedResult(dense_vector=vector) for vector in self._extract_embeddings(payload)
            ]

        try:
            results = self._run_with_retry(
                _call,
                logger=logger,
                operation_name="local batch embedding",
            )
        except Exception as exc:
            raise RuntimeError(f"Local batch embedding failed: {exc}") from exc

        estimated_tokens = sum(self._estimate_tokens(text) for text in formatted)
        self.update_token_usage(
            model_name=self.model_name,
            provider="local",
            prompt_tokens=estimated_tokens,
            completion_tokens=0,
        )
        return results

    def get_dimension(self) -> int:
        return self._dimension

    def close(self):
        close_fn = getattr(self._llama, "close", None)
        if callable(close_fn):
            close_fn()

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Add retries for model downloads

Add retry logic for model downloads using the existing _run_with_retry helper to
improve resilience against transient network errors.

openviking/models/embedder/local_embedders.py [145-161]

 def _download_model(self, url: str, target: Path) -> None:
     logger.info("Downloading local embedding model %s to %s", self.model_name, target)
     tmp_target = target.with_suffix(target.suffix + ".part")
-    try:
+
+    def _download():
         with requests.get(url, stream=True, timeout=(10, 300)) as response:
             response.raise_for_status()
             with tmp_target.open("wb") as fh:
                 for chunk in response.iter_content(chunk_size=1024 * 1024):
                     if chunk:
                         fh.write(chunk)
         os.replace(tmp_target, target)
+
+    try:
+        self._run_with_retry(
+            _download,
+            logger=logger,
+            operation_name="local model download",
+        )
     except Exception as exc:
         tmp_target.unlink(missing_ok=True)
         raise EmbeddingConfigurationError(
             f"Failed to download local embedding model '{self.model_name}' from {url} "
             f"to {target}: {exc}"
         ) from exc
Suggestion importance[1-10]: 6

__

Why: This improves resilience against transient network errors during model downloads by reusing the existing _run_with_retry helper, making the local embedder more robust.

Low

@Mijamind719 Mijamind719 force-pushed the embedding_local branch 2 times, most recently from 6e57688 to f6ff2a0 Compare April 13, 2026 03:39
@jcp0578
Copy link
Copy Markdown
Contributor

jcp0578 commented Apr 13, 2026

发现有几个

  1. [P1] 现有集合缺少 metadata 时会直接阻断升级路径
    create_collection() 返回 False 后,不再沿用“集合已存在就继续运行”的旧行为,而是要求现有 collection 必须能读到 embedding metadata,否则直接抛 EmbeddingConfigurationError。这会让旧版本创建的 collection,或尚未实现 get_collection_meta() 的存储后端,在升级后无法继续启动,即使它们之前一直可以正常工作。可考虑为 legacy collection 提供兼容分支或显式迁移路径,或补充说明
    文件: openviking/storage/collection_schemas.py
    行号: 225-232

  2. [P2] Local embedder 的 async 路径仍会同步阻塞事件循环
    embed_compat() 会优先调用 embed_async(),但基类默认实现只是同步执行 self.embed(...)。LocalDenseEmbedder 没有覆写 async 方法,因此这里的 llama-cpp-python 推理,以及相关的本地初始化路径,仍会沿着 async 调用链在事件循环线程内执行。对于服务端并发 embedding / 入库场景,这会带来明显的延迟放大和可用性风险。建议显式覆写 async 接口,并把阻塞调用卸载到线程池。
    文件: openviking/models/embedder/local_embedders.py
    行号: 76-89

  3. [P2] model identity 计算失败时会静默降级为仅按模型名比较
    这个分支本来是在把本地模型文件身份编码进 collection metadata,以提高 embedding 兼容性校验的准确性;但当前实现把路径解析和哈希计算整个包在 except Exception: 里,失败后直接回退成 model_identity = model,同时没有任何日志。这样一旦 identity 计算异常,不同本地模型文件就会重新退化成“只按模型名比较”,从而削弱这里新增的不兼容检测。建议至少记录 warning,并只捕获预期异常类型。
    文件: openviking/storage/collection_schemas.py
    行号: 146-154

@Mijamind719 Mijamind719 changed the title 【WIP】feat: add local llama-cpp embedding support feat: add local llama-cpp embedding support Apr 15, 2026
…mports, and config singleton deadlock fix

Made-with: Cursor
Co-authored-by: GPT-5.4 <noreply@openai.com>
@MaojiaSheng MaojiaSheng merged commit 3b0ac8a into volcengine:main Apr 15, 2026
10 of 11 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Apr 15, 2026
qin-ctx pushed a commit that referenced this pull request Apr 15, 2026
* feat: local llama.cpp embedding support, setup wizard, lazy storage imports, and config singleton deadlock fix

Made-with: Cursor
Co-authored-by: GPT-5.4 <noreply@openai.com>

* fix: remove unsupported custom local gguf setup

---------

Co-authored-by: GPT-5.4 <noreply@openai.com>
qin-ctx added a commit that referenced this pull request Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants